Setup

This is an R Markdown workbook! It is a way to combine text and plots in one clean, easy-to-read document. Chunks like this one compile as text! Chunks between back ticks (like below) are passed to R. Stuff preceeded by a hash mark gets presented as section headers.

This is the completed version of the workbook. The student workbook has the same code with some stuff removed and replaced with ??, to encourage active learning.

## ^ you can make the code chunk appear in the document by flagging as include = T

### use lexical decision data from Harald Baayen: get it, open the help file for it, and look at it
data(lexdec)
?lexdec
## starting httpd help server ... done
summary(lexdec)
##     Subject           RT            Trial     Sex      NativeLanguage
##  A1     :  79   Min.   :5.829   Min.   : 23   F:1106   English:948   
##  A2     :  79   1st Qu.:6.215   1st Qu.: 64   M: 553   Other  :711   
##  A3     :  79   Median :6.346   Median :106                          
##  C      :  79   Mean   :6.385   Mean   :105                          
##  D      :  79   3rd Qu.:6.502   3rd Qu.:146                          
##  I      :  79   Max.   :7.587   Max.   :185                          
##  (Other):1185                                                        
##       Correct        PrevType      PrevCorrect          Word     
##  correct  :1594   nonword:855   correct  :1542   almond   :  21  
##  incorrect:  65   word   :804   incorrect: 117   ant      :  21  
##                                                  apple    :  21  
##                                                  apricot  :  21  
##                                                  asparagus:  21  
##                                                  avocado  :  21  
##                                                  (Other)  :1533  
##    Frequency       FamilySize      SynsetCount         Length      
##  Min.   :1.792   Min.   :0.0000   Min.   :0.6931   Min.   : 3.000  
##  1st Qu.:3.951   1st Qu.:0.0000   1st Qu.:1.0986   1st Qu.: 5.000  
##  Median :4.754   Median :0.0000   Median :1.0986   Median : 6.000  
##  Mean   :4.751   Mean   :0.7028   Mean   :1.3154   Mean   : 5.911  
##  3rd Qu.:5.652   3rd Qu.:1.0986   3rd Qu.:1.6094   3rd Qu.: 7.000  
##  Max.   :7.772   Max.   :3.3322   Max.   :2.3026   Max.   :10.000  
##                                                                    
##     Class      FreqSingular      FreqPlural     DerivEntropy   
##  animal:924   Min.   :   4.0   Min.   :  0.0   Min.   :0.0000  
##  plant :735   1st Qu.:  23.0   1st Qu.: 19.0   1st Qu.:0.0000  
##               Median :  69.0   Median : 49.0   Median :0.0370  
##               Mean   : 132.1   Mean   :109.7   Mean   :0.3856  
##               3rd Qu.: 146.0   3rd Qu.:132.0   3rd Qu.:0.6845  
##               Max.   :1518.0   Max.   :854.0   Max.   :2.2641  
##                                                                
##     Complex         rInfl             meanRT         SubjFreq    
##  complex: 210   Min.   :-1.3437   Min.   :6.245   Min.   :2.000  
##  simplex:1449   1st Qu.:-0.3023   1st Qu.:6.322   1st Qu.:3.160  
##                 Median : 0.1900   Median :6.364   Median :3.880  
##                 Mean   : 0.2845   Mean   :6.379   Mean   :3.911  
##                 3rd Qu.: 0.6385   3rd Qu.:6.420   3rd Qu.:4.680  
##                 Max.   : 4.4427   Max.   :6.621   Max.   :6.040  
##                                                                  
##     meanSize       meanWeight          BNCw               BNCc        
##  Min.   :1.323   Min.   :0.8244   Min.   : 0.02229   Min.   : 0.0000  
##  1st Qu.:1.890   1st Qu.:1.4590   1st Qu.: 1.64921   1st Qu.: 0.1625  
##  Median :3.099   Median :2.7558   Median : 3.32071   Median : 0.6500  
##  Mean   :2.891   Mean   :2.5516   Mean   : 7.37800   Mean   : 5.0351  
##  3rd Qu.:3.711   3rd Qu.:3.4178   3rd Qu.: 7.10943   3rd Qu.: 2.9248  
##  Max.   :4.819   Max.   :4.7138   Max.   :79.17324   Max.   :83.1949  
##                                                                       
##       BNCd           BNCcRatio         BNCdRatio     
##  Min.   :  0.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:  1.188   1st Qu.:0.09673   1st Qu.:0.5551  
##  Median :  3.800   Median :0.27341   Median :0.9349  
##  Mean   : 12.995   Mean   :0.45834   Mean   :1.5428  
##  3rd Qu.: 10.451   3rd Qu.:0.55550   3rd Qu.:2.1315  
##  Max.   :241.561   Max.   :8.29545   Max.   :6.3458  
## 
## and once you run a chunk that includes output, you can minimise it again by hitting the double-arrow in the top right.

Ins and outs of ggplot: building a plot from the ground up.

An idea borrowed from http://joeystanley.com/blog/making-vowel-plots-in-r-part-1

Let’s build a ggplot object from the beginning…

#open a plot window
ggplot()

Add data to it!– still no obvious output

ggplot(data=lexdec)

To show the ggplot object, we include which factors to use with aes (aesthetics) arguments.

As a shorthand, the first argument will be read as the data. You don’t have to specify it.

This plot window now has X and Y axes.

ggplot(lexdec,aes(y=RT,x=Frequency))

Points & lines!

We add stuff in to the plot with ‘geom’ functions that literally are added commands.

ggplot(lexdec,aes(y=RT,x=Frequency))+
  geom_point()

We can also save the object at any stage of the process, and then add to it as we go.

pl1 <- ggplot(lexdec,aes(y=RT,x=Frequency))

pl1 + geom_point()

Add a smoothed line;

pl1 + geom_smooth(method='lm')

Add points & a smoothed line!

pl1 + geom_point() + geom_smooth(method='lm')

There are many smooth options. Search geom_smooth docs online for details.

Two more:

Loess smooth (a local regression)

pl1 + geom_point() + geom_smooth(method='loess')

General additive model

pl1 + geom_point() + geom_smooth(method='gam')

Fixing overplotted points

In the plots we made above, there are a lot of points on top of each other!

To fix, use geom_jitter instead of geom_point. Geom_jitter is like geom_point but has parameters for randomly moving points horizontally and vertically.

pl1 + geom_jitter(width=.1,height=.1)

Try with different levels of jitter… This is too much.

pl1 + geom_jitter(width=1,height=1)

Try with adjustment to alpha (point transparency) instead.

pl1 + geom_point(alpha=.1)

Add it all together!

An excellent feature of ggplot is that you can add on aes values for any number of things at once. Let’s now add to the above plot, with lines for native language and shapes for whether the trial was correct.

We will write this out in full, so that we can specify the aes directly in the ggplot call. I like this plot, but want to add some color to it.

ggplot(lexdec,aes(y=RT,x=Frequency,linetype=NativeLanguage,shape=Correct)) +
  geom_point(alpha=.5)+
  geom_smooth(method='lm')

Adding & changing colors

Adding color is a good way to make things visually distinct. Here, I’m also supressing the se bands on the smooths.

pl2 <- ggplot(lexdec,aes(y=RT,x=Frequency, shape=Correct, lty=Correct, color=NativeLanguage)) +
  geom_point(alpha=.5)+
  geom_smooth(method='lm',se=F)

pl2

Color is also avaliable for things with lots of levels…but it can get hard to read!

pl3 <- ggplot(lexdec,aes(y=RT,x=Frequency,
          color=Subject,lty=NativeLanguage)) +
  geom_point(alpha=.5)+
  geom_smooth(method='lm',se=F)

pl3

Facets

We can add panels to a plot with facet_grid or facet_wrap.

pl2 + facet_grid(.~NativeLanguage)

pl2 + facet_grid(NativeLanguage~.)

Facet grid makes a little grid of panels (rows~columns)

pl2 + facet_grid(Correct~NativeLanguage)

Facet_wrap wraps around… This is useful for something with lots of levels.

pl3 + facet_wrap(~Subject,ncol=7) ## can specify ncol(umns) and nrow(s)

Labels

We can change the labels and scales on x and y axes!

pl2 + scale_x_continuous(name="Word Frequency") +
  scale_y_continuous(name="Lexical Decision RT")

pl2+scale_x_continuous(name="Word Frequency",limits=c(1,10))+
scale_y_continuous(name="Lexical Decision RT",limits=c(5.5,8))

We can also add a main title

pl2 + 
 ggtitle("Lexical decision RT predicted by word frequency and correctness ") 

## these can include enters
pl2 + 
 ggtitle("Lexical decision RT \n predicted by word frequency and correctness")

## and we can center them!
pl2 + 
  ggtitle("Lexical decision RT \n predicted by word frequency and correctness")+
 theme(plot.title = element_text(hjust = 0.5))

Bubble plots

A bubble plot just changes the size of points x,y by some factor z. This can be either continuous or discrete.

Change the point size by morphological family size (which co-varies with word-frequency: more frequent words come from more common morphological families.) The factor is called FamilySize

ggplot(lexdec,aes(y=RT,x=Frequency,size=FamilySize))+
  geom_point(alpha=.1)

Adding trend lines

Add a trend line for the data mean

mRT <- mean(lexdec$RT)

pl4 <- ggplot(lexdec,aes(y=RT,x=Frequency,size=FamilySize))+
  geom_point(alpha=.1)

pl4 +  geom_hline(yintercept=mRT,color='red')

Add a trend line for the average frequency

mF <- mean(lexdec$Frequency)

pl4 +  geom_vline(xintercept=mF,color='red')

Color points based upon mean value

lexdec$RTBin <- as.factor( ifelse(lexdec$RT > mRT, 2,1) )

ggplot(lexdec,aes(y=RT,x=Frequency,size=FamilySize,color=RTBin))+
  geom_point(alpha=.1)+
  geom_hline(yintercept=mRT,color='red')

Adding point labels

Let’s tabulate the data to get average RTs by word.

#Do tabulations using dplyr:
mbyWord <- lexdec %>% group_by(Frequency,Word)%>% 
  summarise(meanRT=mean(RT))

ggplot(mbyWord,aes(x=Frequency,y=meanRT,label=Word)) +
 geom_text()

Heat maps

Let’s make a heat map. These need a data frame of x by y, containing z values. Let’s try a different visualization of RT for correct and error trials by participant, split by their native language.

#Do tabulations using dplyr:
corRT <- lexdec %>% group_by(Subject,Correct,NativeLanguage)%>% 
  summarise(meanRT=mean(RT))

ggplot(corRT,aes(x=Correct,y=Subject)) +
  geom_tile(aes(fill=meanRT)) +
  facet_grid(~NativeLanguage)

## we can change the color map to be a little more useful
ggplot(corRT,aes(Correct,Subject)) +
  geom_tile(aes(fill=meanRT)) +
  scale_fill_gradient(low="yellow",high="red")

Customizing colors and line types

Let’s build our own color map that assigns colors to participants based upon their native language. It also serves as a little bit of a coding exercise.

## let's make a new variable that we can use to cluster together all the English / Other subs. Paste together the native language of the person, and their subject ID
lexdec$Subj2 <- paste(lexdec$NativeLanguage,
                      lexdec$Subject,sep=" ")

## to make a new color palette
## get a list of all English speakers 
# & find out how big it is
## there are lots of ways to get this number-- here is one of them!
eng<- lexdec %>% filter(NativeLanguage=="English") %>%
  group_by(Subject) %>% summarise()
dim(eng)[1]
## [1] 12
## 12 levels, so we need 12 shades of blue
## I'm asking a function to create 12 divisions of blue colors
## between two named endpoints
## r has lots of named colors!-- see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
## you can also use Hexadecimal colors (like in this cols list)

blues<-colorRampPalette(colors=c("steelblue1","darkblue"))(12)

## now get a list of all other speakers & find out its size
oth <- lexdec %>% filter(NativeLanguage=="Other") %>%
  group_by(Subject) %>% summarise()
dim(oth)[1]
## [1] 9
## 9 levels, so we need 9 shades of pink
pinks <- colorRampPalette(colors=c("pink","magenta"))(9)

## concatenate the two sets of colors, making one list
colors <- c(blues,pinks)

## use this as our color palette, with the sorted subject variable
ggplot(lexdec,aes(y=RT,x=Frequency,color=Subj2,
                  lty=NativeLanguage)) +
  geom_point(alpha=.5)+
  geom_smooth(method='lm',se=F)+
  scale_color_manual(values=colors)

That legend is very hard to read. Fix it by supressing values for color. We’ll do this in two steps.

pl7 <- ggplot(lexdec,aes(y=RT,x=Frequency,
         color=Subj2,lty=NativeLanguage)) +
  geom_point(alpha=.5)+
  geom_smooth(method='lm',se=F)+
  ## supress the legend for color
  scale_color_manual(values=colors,guide=FALSE) 

pl7

## and now redo the legend for line type to include color
# use guides() and override.aes() functions to do this
pl7<- pl7 + guides(linetype = guide_legend(override.aes = 
            list(color = c("blue","magenta")) ))
pl7

Scale_XXX_manual can be used to change all types of parameters, such as legends, or line types.

pl7 + scale_linetype_manual(name="Native Language",values=c(1,4))  ## these are built-in line types

I also want to change point shapes. Here’s a list of point shape (pch) values: http://www.sthda.com/english/wiki/ggplot2-point-shapes

pl8 <- ggplot(lexdec,aes(y=RT,x=Frequency,color=Subj2,
     shape=NativeLanguage)) +
  geom_point(alpha=.5)+
  geom_smooth(method='lm',se=F)+
  facet_grid(~NativeLanguage)+
  scale_color_manual(values=colors,guide=FALSE)+
  scale_shape_manual(values=c(1,18),guide=FALSE)                                        
pl8

Change some more labels

#set up a plot
pl8b <- ggplot(lexdec,aes(y=RT,x=Frequency,color=Subj2,
       shape=NativeLanguage,linetype=NativeLanguage)) +
       facet_grid(~NativeLanguage)+
       geom_point(alpha=.5)+
       geom_smooth(method='lm',se=F)+
       scale_color_manual(values=colors,guide=FALSE)
  
## if we're mucking with legend titles, we have to also add a title to points
### compare this...  
pl8b +  scale_linetype_manual(name="Native Language",values=c(1,4))+
        guides(linetype = guide_legend(override.aes = 
               list(color = c("blue","magenta")) ))

## with this...
pl8b +  scale_linetype_manual(name="Native Language",values=c(1,4))+
        scale_shape_manual(name="Native Language", values=c(1,18))+
        guides(linetype = guide_legend(override.aes = 
            list(color = c("blue","magenta")) ))

Compile your code up until now (hit ‘knit’)… does it work?

You can change to knit to PDF (if you have Latex) or to Word (if you have it). Use the down-arrow by the knit button.

In-Class Exercise

Make some plots using the tools we have developed today.

Option 1: Pick a new variable in the Lexdec dataset. Use the tools we built today to understand what it does. Run the code ?lexdec to get more info!

Option 2: Pick a new data set, and run one of the same plots we created above with the new variables. These are some interesting data sets you have in active R libraries. Run the code ?data_set_name to get more info, where the options below are data_set_name

Next week

We will cover the distribution plots: bar plots + error bars, violin plots, beeswarms, and ‘ridges’ (joyplots). Plus, anything else you guys want covered.

Send email to laubre@mpi.nl if you have requests of a plot type, or a specific type of data to cover.